Predictive Checks + Final Project

Monday, June 5

Today we will…

  • Canvas Announcement
  • Cookie allergies?
  • Questions from Week 9: Linear Regression?
  • Data Intro + Cleaning: Feedback
  • New Material:
    • Predictive Checks
  • Final Project Work Time

Data Intro + Cleaning: Feedback

Your grade reflects the completeness of your submission, not the correctness!

  • Describe how the data were collected, if provided.
    • E.g., how do we know the average life expectancy?
  • Explain why you performed your chosen cleaning steps.
  • This assignment is to assess your learning in this course. You are expected to use functions and code techniques from this class.
    • You will lose points for using non-tidyverse functions to do tasks that we have discussed in this class!

Data Intro + Cleaning: Feedback

Code:

  • All code should be hidden – use echo: false.
  • Don’t name the R functions you have used (“We used str_detect to…”.).
    • Instead, describe what you did in plain English.
  • Don’t use dataset or variable names in the text.
    • Say “We removed missing values from per capita GDP.” rather than “We removed NA from per_cap_gdp.”
  • Don’t print out the head of the data!

Data Intro + Cleaning: Feedback

Citations:

  • Cite your sources, including:
    • data sources.
    • description of your variables that is not general knowledge.
  • You should have both in-line citations and a References section at the end of your report.

Data Intro + Cleaning: Feedback

Style + Organization:

  • Define all acronyms, especially any that are related to the variables of interest.
  • Everything should be in paragraph form – no bullets or numbered lists.
  • Read through your paper from top to bottom to make sure the organization makes sense.
    • At what point might someone get confused?

Predictive Checks

Any good analysis should include a check of the “adequacy of the fit of the model to the data and the plausibility of the model…” – Andrew Gelman

Predictive Checks

Predictive checks allow us to assess if our fitted model would produce data similar to the data that we observed.

  • Yes? Our model is a good fit.
  • No? Our model is not a good fit.

This is an assessment of model fit.

Danger

Predictive checks are not aimed to make predictions of the response variable for new observations of the explanatory variable.

Recall: Linear Regression

For simple linear regression, we assume the responses can be modeled as a linear function of the explanatory variable and some error.

\[y = \beta_0 + \beta_1 x_1 + \varepsilon\]

We also assume that those errors \((\varepsilon)\) follow a normal distribution with mean 0 and standard deviation \(\sigma\).

\[\varepsilon \sim N(0, \sigma)\]

Recall: Linear Regression

Therefore, the data we would expect to come from this model can be generated by:

  1. predicting values from a fitted model (\(\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x_1\)) …

and

  1. … adding normally distributed errors.

Recall: Linear Regression

This method produces data that perfectly agree with the linear model conditions:

Linear relationship between \(x\) and \(y\).

Independence of observations.

Normality of residuals.

Equal variance of residuals.

Predictive Checks

If we compare data generated from the linear model to the observed data, we can determine how well the observed data and linear model fit.

  • Is it plausible that the observed data could be generated by the model?

The Process

To perform a predictive check…

  1. Fit a regression model to the observed data.

  2. For a set of explanatory values, obtain predicted response values from the model.

  3. Add random errors to the predictions.

  4. Compare the simulated data to the observed data.

  5. Iterate!

The Process

To perform a predictive check…

  1. Fit a regression model to the observed data.

Use the lm() function…

The Process

To perform a predictive check…

  1. Fit a regression model to the observed data.

  2. For a set of explanatory values, obtain predicted response values from the model.

Use the predict() function…

The Process

To perform a predictive check…

  1. Fit a regression model to the observed data.

  2. For a set of explanatory values, obtain predicted response values from the model.

  3. Add random errors to the predictions.

Use the rnorm() function…

The random errors have mean 0 and standard deviation estimated by the residual standard error (use sigma()).

The Process

To perform a predictive check…

  1. Fit a regression model to the observed data.

  2. For a set of explanatory values, obtain predicted response values from the model.

  3. Add random errors to the predictions.

  4. Compare the simulated data to the observed data.

Use the lm() function to regress observed on simulated…

To measure similarity, record \(R^2\) (proportion of variability in \(y\) explained by a linear relationship with \(x\)).

The Process

To perform a predictive check…

  1. Fit a regression model to the observed data.

  2. For a set of explanatory values, obtain predicted response values from the model.

  3. Add random errors to the predictions.

  4. Compare the simulated data to the observed data.

  5. Iterate!

Use the map() function to repeat the process over and over…

We want to see how the model performs across many simulated datasets.

  • Compute the \(R^2\) for each.

Instead of \(R^2\), could use correlation \((r)\), sum of squared errors \((SSE)\), or the estimate of \(\sigma\) \((RMSE)\) to measure similarity.

Distribution of Simulated \(R^2\)

Plot the distribution of simulated \(R^2\) values to see how well the model performs.

  • Values distributed near 1 indicate a good fit!

For your project…

For your group project, you will run predictive checks to assess how well your model performs.

  • This is Section 3 of the Project Details page.

To do…

  • Course Evaluation
    • Closes Friday, 6/9 at 11:59pm.
  • Final Project Report
    • Due Monday, 6/12 at 11:59pm.
  • Final Exam
    • Friday, 6/16 from 10:10-1:00 for the 10:10am section.
    • Wednesday, 6/14 from 10:10-1:00 for the 12:10pm section.

Wednesday, June 7

Today we will…

  • Linear Regression: Feedback
  • Final Exam: What to Expect
  • Remaining Q & A
  • R Hex Cookies!
  • Final Project Work Time

Linear Regression: Feedback

  • Think about the readability of the numbers you are presenting.

    • Do you need 6 decimal places?
    • Is scientific notation easily understood by the public?
  • Include units on your plots!

  • If you do any transformations, make sure you mention them.

    • Also make sure they are clear on any plots!

Linear Regression: Feedback

  • When you present a plot or a table, discuss in words what you want the reader to take away from it.
    • Discuss the table of variances as part of your discussion of model fit.
  • If you are modeling the average across years (or one particular year) make sure you include a plot of the average (or that year) in addition to the full data.

Linear Regression: Feedback

  • Some of you used a ratio of the response to the explanatory to show the relationship over time.
    • This is often not easy to understand or interpret.
    • I encourage you to find a clearer way to display this relationship over time.
    • If you choose to go this route, you will need lots of clear explanation about what ratio you are calculating and what it means.

Final Exam: What to Expect

  • The exam is worth a total of 100 points and has 3 parts: General Questions, Short Answer, and Statistical Modeling.
  • You will have 2 hours and 50 minutes for the entire exam.
  • You will complete Part 1: General Questions first.
    • This part is closed note and closed computer.
  • You can complete the questions in Part 2: Short Answer and Part 3: Statistical Modeling in any order.
    • A .qmd starter file will be opened at the start of each final.
    • I will pass out paper copies of the questions.

Final Exam: What to Expect

The exam is cumulative and will definitely contain questions on:

  • Data manipulations with dplyr and tidyr.
  • Data visualizations with ggplot.
  • Function writing.
  • Functional programming with map,
  • Statistical modeling with lm.

Q & A

Cookies!

To do…

  • Course Evaluation
    • Closes Friday, 6/9 at 11:59pm.
  • Final Project Report
    • Due Monday, 6/12 at 11:59pm.
  • Final Exam
    • 10:10-12:00 Section: Friday, 6/16 from 10:10am - 1:00pm.
    • 12:10-2:00 Section: Wednesday, 6/14 from 10:10am - 1:00pm.